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Analysis of experimental data must sometimes deal with abrupt changes in the distribution of 
measured values. Setting upper limits on signals usually involves a veto procedure that excludes 
data not described by an assumed statistical model. We show how to implement statistical estimates 
of physical quantities (such as upper limits) that are valid without assuming a particular family of 
statistical distributions, while still providing close to optimal values when the data is from an ex- 
pected distribution (such as Gaussian or exponential). This new technique can compute statistically 
sound results in the presence of severe non-Gaussian noise, relaxes assumptions on distribution sta- 
tionarity and is especially useful in automated analysis of large datasets, where computational speed 
is important. 
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INTRODUCTION 



Data collected in experiments is sometimes contami- 
nated by noise with an ill-behaved and often unknown 
distribution, presenting problems for the traditional 
method of using distribution quantiles to establish up- 
per limits or confidence intervals. This problem happens 
especially often in experiments that collect large volumes 
of data. 

A common solution is to exclude contaminated data 
from the analysis. For example, figure [I] shows a small 
portion of data obtained in the LIGO search for contin- 
uous gravitational waves in fifth science run. The blue 
points mark regions where non-Gaussian behaviour was 
detected and upper limit values are not expected to be 
valid. 

In fact, if one looks carefully at the data for each point 
one can find a cause of non-Gaussian behaviour and a 
workaround to establish an upper limit - but the causes 
are different for different points, making analysis very 
laborious. 

What is desired is an automated way to establish an 
upper limit that would be correct (if a bit conservative) 
for an arbitrary distribution, while still being close to op- 
timum in the case of Gaussian noise (or other distribution 
class) that commonly occurs in the data. 

We present a new algorithm that establishes upper lim- 
its without assuming a specific underlying background 
distribution, and that can be optimized for an arbitrary 
class of distributions (such as Gaussian, exponential, etc) 
that are expected to commonly occur in the data. 

This advance allows one to obtain valid upper limits on 
signal strengths in the presence of ill-behaved and poorly 
understood background. 
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FIG. 1. Upper limit data from LIGO S5 search for continuous 
gravitational waves in the 50-200 Hz frequency rangepQ. A 
large number of non-Gaussian bands significantly reduces the 
usefulness of the results, (color online) 



UNIVERSAL INEQUALITIES AND STATISTIC 

Let us consider a sample problem. Suppose we have 
obtained many samples of data which consists of back- 
ground noise plus a possible signal. The data is collected 
in batches of N samples X{ for which the background 
noise £j is independent and identically distributed. Also, 
we expect that at most one sample j in each batch con- 
tains a signal: 

Xi = & + sSij (1) 
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We would like to place a limit on the strength of the 
signals that may (or may not) be present in our data set. 

If we knew that the noise £j were drawn from a partic- 
ular distribution p (such as Gaussian) our task would be 
straightforward - we would find the maximum Xi in all 
our data, subtract the mean of the background p and add 
a distribution specific correction C p that accounts for the 
possibility that a particular sample with the signal was 
below the background mean: 



UL p = max 2^ 



C n 



(2) 



If the distribution of £j is not known with certainty, 
then we can try to estimate the distribution from the data 
itself. This, however, is problematic when the amount of 
data is small. 

We now notice that, regardless of the procedure to 
compute the correction, it ends up being the function 
of the input data: 



UL 



max Xi 



H + C({xi}) 



(3) 



We can now pose the following problem: 
Suppose we are given a confidence level 1 — e, a class of 
commonly encountered distributions T> and a tolerance 
a. We need to find a function C({xi}) of the input data 
such that: 

1. For any s and any distribution of & we have 

P(UL < s) < e (4) 

2. We require that for any distribution p € T> the up- 
per limits are overestimated by at most a compared 
to what we could obtain with full knowledge of the 
distribution: 



UL 



< 1 



(5) 



We call such statistics universal as they are applicable 
regardless of the distribution of noise we have. 

DERIVATION OF UPPER LIMIT STATISTIC 

In probability theory distribution-independent bounds 
are commonly obtained by use of Chebyshev-Bienayme's 
or Markov's inequalities, however they are rarely used 
in practice, since in common applications they provide 
bounds that are far too loose. 

For example, Encyclopaedia Britanica writes "Unfor- 
tunately, with virtually no restriction on the shape of an 
underlying distribution, the inequality is so weak as to 
be virtually useless to anyone looking for a precise state- 
ment on the probability of a large deviation. To achieve 
this goal, people usually try to justify a specific error 
distribution, such as the normal distribution..." [2J. 



This is because even though Chebyshev-Bienayme's or 
Markov's inequalities are sharp - turning into equalities 
for an appropriate probability distribution - these distri- 
butions are rarely encountered in practice. 

There exists a stronger Vysochanskij-Petunin inequal- 
ity [3] but it relies on distributions being unimodal - an 
assumption that is hard to establish in empirical data. A 
review of other Chebyshev type inequalities can be found 
in [1[S]. 

We engineer an upper limit statistic by starting with 
Markov's inequality 



P(\X\>a)< 



and modifying it to read 



P(\f(X)\>a)< 



E\X\ 



El/POl 



Then a further modification yields: 



P 



> a < 



/(^) 



(6) 



(7) 



(8) 



for, in general, arbitrary p and a > - though in prac- 
tice these are chosen to be estimates of the mean and 
standard deviation. After setting 



/ 



X-a 



we obtain 



P 



X — p 



> 



(9) 



< e (10) 



Because the original Markov's inequality is correct for 
a random variable X with an arbitrary distribution, in- 
equality (10) is valid for any choice of f(x), p and a - 



even when p and a are estimated from the data X. 

We can now optimize fix) to provide more precise up- 
per limits or confidence intervals for our desired distri- 
bution. As a quick example, the inequality [10] becomes 
sharp for a Gaussian random variable X when we choose 
p = EX , a = \/Var X and use a step function 



/.(*) 



1 when x > x e 
otherwise 



where x f satisfies 



~ x2 ' 2 dx = 1-6 



(11) 



(12) 



The choice f(x) — f s (x) is difficult to apply to estab- 
lish a confidence interval because the function f s (x) is not 

invertible: it can happen that the average of j f 
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for practical data is greater than 1 which does not yield 
a constraint on X. One approach could be to pick initial 
x ^ as defined by equation [12] and then iterate to establish 
a bound for X. This is cumbersome for both analytical 
and numerical computation. 

A better way is to use an f{x) that is invertible above 
x t . An especially simple and computationally efficient 
example, shown in figure [2] is given by 

1 + \ (x — x e ) when x > x t 
otherwise 



fc(x) 



(13) 



with the corresponding inverse function given by 
x t + 2{x-l) 



when x > 1 
otherwise 



(14) 
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FIG. 2. Function f c used for computation of 95% confidence 
level upper limits (simulation results are shown in figure HJ. 
The point x e has been shifted to the right to compensate for 
errors in mean estimates for 501 points of input data. The 
dashed line shows the step function f s that makes inequality 



10 exact in the ideal case, (color online) 



Our correction C is then 



C = afl 



X - n 



(15) 



with expectation replaced by average for empirical data. 



PERFORMANCE OF UNIVERSAL UPPER 
LIMIT STATISTIC 

A step by step algorithm for computing the upper limit 
is shown in figure [3] It incorporates two adjustements 



that we find important in practical implementation. 

First, the point x t has been increased by to 
compensate for possible errors in estimation of mean \x 
which could lead to effectively underestimating x e . 

Secondly, the standard deviation a is estimated using 
data from the lower tail of the distribution only as ap- 
propriate for establishing an upper limit. 

The main steps 2-6 of the algorithm [3] employ only 
piecewise linear functions allowing for very efficient im- 
plementation on virtually any computational platform. 



1. Prepare by computing value of 

x e =F- 1 (l-e) + 5/VN 



Compute M 
Compute (i = 



maxi=i..iv Xi 



— S^ 1 

N 



Compute a — Y2iLi niiii(xj, 0). We prefer this 

formula as it is simpler to compute and is insensitive 
to outliers in the upper tail of the distribution. 

— X^ N f 



Compute S — 
Establish upper limit UL 



FIG. 3. Algorithm for computing the upper limit from a single 
batch of data of N points. 

To gauge the performance of our new statistic, we per- 
formed a simulation that closely reflects real-world situa- 
tions we encountered in data analysis [T] . We assume that 
our data consists of independent samples of noise plus a 
possible deterministic signal in one or more bins. The 
data is analyzed in batches of 501 data samples for each 
of which we establish an upper limit on signal strength. 
The final reported value is the worst case (i.e. maximum) 
upper limit among 100 batches. 

The comparison of this worst case upper limit to the 
upper limit established analytically from the known dis- 
tribution of underlying noise is shown in figure |4j The 
samples consisted of identically distributed pure noise 
(« = 0). 

We have averaged the ratios of established upper lim- 
its to theoretical ideal values. As the value of x e was 
obtained assuming Gaussian data we see that our statis- 
tic achieves less than 5% overestimate both for 90% and 
95% confidence level upper limits. 

A number of other distributions have been tried. As 
seen on the plot, the performance is remarkably flat for 
X 2 distributions with different degrees of freedom and the 
overestimate is moderate for uniform distribution. The 
heavy-tailed Student's t-distributions, as well as lognor- 
mal distribution, show good performance as well. Even 
in the extreme case of Bernoulli distribution, with equal 
probability to obtain and 1, our overestimate is less 
than 50% for 95% confidence level. Finally, the custom 
distribution testl composed of three populations of nor- 
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FIG. 4. Average overestimate of upper limit by the universal 
statistic as compared to the value predicted by analytical for- 
mula for the corresponding distributions. The overestimate 
is less than 5% for Gaussian data, and we expect any practi- 
cal measurement to perform worse than the ideal case. The 
upper limits were computed using f c (see figure [2J with 501 
points of data for different noise distributions: exponential, 
X 2 with different degrees of freedom, Gaussian, Student's t- 
distribution with different degrees of freedom, lognormal, uni- 
form, Bernoulli with equal probability to obtain each outcome 
and a custom distribution testl the histogram for which is 
shown on figure [5] The points on the graph were obtained 
by averaging 100 independent measurements, each of which 
consisted of finding the maximum among 100 upper limits 
to simulate maximization across a set of templates, (color 
online) 



mal and exponentially distributed numbers (figure [5]) has 
overestimate of only 20% for 95% confidence level. 



FIG. 5. Distribution testl used in figure[4] It is composed of 
three populations, two normal and one exponential. We also 
show distribution-specific and universal 95% confidence level 
upper limit for this batch of 501 numbers, (color online) 
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CONCLUSIONS 



We have described a new universal statistic that pro- 
duces reliable and useful upper limits regardless of the 
underlying distribution of noise, while still producing 
close to optimum values for a specific family of distri- 
butions. The algorithm for computing its values is very 
practical, and is easily implemented for large scale com- 
putation. 

This opens the road for publication of reliable results 
from large data sets with only partial understanding of 
distributional properties of data they contain. 
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